Goto

Collaborating Authors

 tool use


Watch: Cow astonishes scientists with rare use of tools

BBC News

Scientists are rethinking what cattle are capable of after an Austrian cow named Veronika was found to use tools with impressive skill. The discovery, reported by researchers in Vienna, suggests cows may have far greater cognitive abilities than previously assumed. Veronika, a cow living in a mountain village in the Austrian countryside, has spent years perfecting the art of scratching herself using sticks, rakes, and brooms. Word of her behaviour eventually reached animal intelligence specialists in Vienna, who found Veronika used both ends of the same object for different tasks. If it were her back or another tough area that warranted a good scratch, she would use the bristle end of a broom.


Veronika the Cow shocks scientists by using a tool

Popular Science

The 13-year-old bovine is crushing stereotypes of bovine intelligence. Breakthroughs, discoveries, and DIY tips sent six days a week. The smart animal club continues to add new members, and the newest might surprise you. A pet cow in Austria named Veronika picks up sticks with her mouth and uses them to scratch herself--which a team at University of Veterinary Medicine, Vienna in Austria believes is tool use. Veronika and her ground-breaking scratching are detailed in a study published today in . "The findings highlight how assumptions about livestock intelligence may reflect gaps in observation rather than genuine cognitive limits," Alice Auersperg, a study co-author and cognitive biologist at the university, said in a statement .

  Country: Europe > Austria > Vienna (0.25)
  Genre: Research Report > New Finding (0.91)
  Industry: Food & Agriculture > Agriculture (0.31)

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Guo, Zirun, Hong, Minjie, Zhang, Feng, Jia, Kai, Jin, Tao

arXiv.org Artificial Intelligence

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.


Wolf uses tool in stunning video

Popular Science

The gray wolf reeled in a crab trap with a rope, sparking a debate among biologists. Breakthroughs, discoveries, and DIY tips sent every weekday. Some 300 miles north of Vancouver, nestled among the rocky bays and forests of the Haíɫzaqv Nation, a wily gray wolf helps itself to a snack. On its own, this isn't remarkable and happens all the time. But a wild wolf swimming to a buoy, reeling it in, and then pulling an underwater trap to shore before eating the bait?

  Country: North America > United States > Idaho (0.05)
  Genre: Research Report > New Finding (0.51)
  Industry: Retail (0.31)

From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

Bayat, Farima Fatahi, Pezeshkpour, Pouya, Hruschka, Estevam

arXiv.org Artificial Intelligence

Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.


Why are most people right-handed?

Popular Science

Why are most people right-handed? A mix of biology, environment, and evolution helps explain our rightie-dominated world. Around 85 to 90 percent of people are right-handed. Breakthroughs, discoveries, and DIY tips sent every weekday. Roughly 85 to 90 percent of people are right-handed, while just 10 to 15 percent are left-handed, and a small percentage are ambidextrous.


Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling

Wang, Qi, Zhang, Hongzhi, Fu, Jia, Fu, Kai, Liu, Yahui, Zhang, Tinghai, Sun, Chenxi, Jiang, Gangwei, Tang, Jingyi, Ji, Xingguang, Yue, Yang, Zhang, Jingyuan, Zhang, Fuzheng, Gai, Kun, Zhou, Guorui

arXiv.org Artificial Intelligence

Despite the proliferation of powerful agentic models, the lack of critical post-training details hinders the development of strong counterparts in the open-source community. In this study, we present a comprehensive and fully open-source pipeline for training a high-performance agentic model for interacting with external tools and environments, named Klear-Qwen3-AgentForge, starting from the Qwen3-8B base model. We design effective supervised fine-tuning (SFT) with synthetic data followed by multi-turn reinforcement learning (RL) to unlock the potential for multiple diverse agentic tasks. We perform exclusive experiments on various agentic benchmarks in both tool use and coding domains. Klear-Qwen3-AgentForge-8B achieves state-of-the-art performance among LLMs of similar size and remains competitive with significantly larger models.


A Tutorial on Cognitive Biases in Agentic AI-Driven 6G Autonomous Networks

Chergui, Hatim, Rezazadeh, Farhad, Debbah, Merouane, Verikoukis, Christos

arXiv.org Artificial Intelligence

The path to higher network autonomy in 6G lies beyond the mere optimization of key performance indicators (KPIs). While KPIs have enabled automation gains under TM Forum Levels 1--3, they remain numerical abstractions that act only as proxies for the real essence of communication networks: seamless connectivity, fairness, adaptability, and resilience. True autonomy requires perceiving and reasoning over the network environment as it is. Such progress can be achieved through \emph{agentic AI}, where large language model (LLM)-powered agents perceive multimodal telemetry, reason with memory, negotiate across domains, and act via APIs to achieve multi-objective goals. However, deploying such agents introduces the challenge of cognitive biases inherited from human design, which can distort reasoning, negotiation, tool use, and actuation. Between neuroscience and AI, this paper provides a tutorial on a selection of well-known biases, including their taxonomy, definition, mathematical formulation, emergence in telecom systems and the commonly impacted agentic components. The tutorial also presents various mitigation strategies tailored to each type of bias. The article finally provides two practical use-cases, which tackle the emergence, impact and mitigation gain of some famous biases in 6G inter-slice and cross-domain management. In particular, anchor randomization, temporal decay and inflection bonus techniques are introduced to specifically address anchoring, temporal and confirmation biases. This avoids that agents stick to the initial high resource allocation proposal or decisions that are recent and/or confirming a prior hypothesis. By grounding decisions in a richer and fairer set of past experiences, the quality and bravery of the agentic agreements in the second use-case, for instance, are leading to $\times 5$ lower latency and around $40\%$ higher energy saving.


MENTOR: A Reinforcement Learning Framework for Enabling Tool Use in Small Models via Teacher-Optimized Rewards

Choi, ChangSu, Song, Hoyun, Kim, Dongyeon, Jung, WooHyeon, Cho, Minkyung, Park, Sunjin, Bae, NohHyeob, Yu, Seona, Lim, KyungTae

arXiv.org Artificial Intelligence

Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.


OrchDAG: Complex Tool Orchestration in Multi-Turn Interactions with Plan DAGs

Lu, Yifu, Liu, Shengjie, Dong, Li

arXiv.org Artificial Intelligence

Agentic tool use has gained traction with the rise of agentic tool calling, yet most existing work overlooks the complexity of multi-turn tool interactions. We introduce OrchDAG, a synthetic data generation pipeline that models tool execution as directed acyclic graphs (DAGs) with controllable complexity. Using this dataset, we benchmark model performance and propose a graph-based reward to enhance RLVR training. Experiments show that the dataset presents a challenging but solvable benchmark, and the proposed reward is effective when combined with GRPO-style algorithms, highlighting the importance of leveraging topological structure and data complexity in multi-turn tool use.